BIL476 - BANKING DATASET ANALYSIS¶
Upload The Banking Dataset¶
In Banking Dataset Analysis Project, the dataset used is "Bank Marketing" dataset from UCI Machine Learning Repository.
from ucimlrepo import fetch_ucirepo
# fetch dataset
bank_marketing = fetch_ucirepo(id=222)
# data (as pandas dataframes)
X = bank_marketing.data.features
y = bank_marketing.data.targets
# metadata
print(bank_marketing.metadata)
# variable information
display(bank_marketing.variables)
{'uci_id': 222, 'name': 'Bank Marketing', 'repository_url': 'https://archive.ics.uci.edu/dataset/222/bank+marketing', 'data_url': 'https://archive.ics.uci.edu/static/public/222/data.csv', 'abstract': 'The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).', 'area': 'Business', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 45211, 'num_features': 16, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Age', 'Occupation', 'Marital Status', 'Education Level'], 'target_col': ['y'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 2014, 'last_updated': 'Fri Aug 18 2023', 'dataset_doi': '10.24432/C5K306', 'creators': ['S. Moro', 'P. Rita', 'P. Cortez'], 'intro_paper': {'title': 'A data-driven approach to predict the success of bank telemarketing', 'authors': 'Sérgio Moro, P. Cortez, P. Rita', 'published_in': 'Decision Support Systems', 'year': 2014, 'url': 'https://www.semanticscholar.org/paper/cab86052882d126d43f72108c6cb41b295cc8a9e', 'doi': '10.1016/j.dss.2014.03.001'}, 'additional_info': {'summary': "The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. \n\nThere are four datasets: \n1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]\n2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.\n3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs). \n4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs). \nThe smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM). \n\nThe classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).", 'purpose': None, 'funded_by': None, 'instances_represent': None, 'recommended_data_splits': None, 'sensitive_data': None, 'preprocessing_description': None, 'variable_info': 'Input variables:\n # bank client data:\n 1 - age (numeric)\n 2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",\n "blue-collar","self-employed","retired","technician","services") \n 3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)\n 4 - education (categorical: "unknown","secondary","primary","tertiary")\n 5 - default: has credit in default? (binary: "yes","no")\n 6 - balance: average yearly balance, in euros (numeric) \n 7 - housing: has housing loan? (binary: "yes","no")\n 8 - loan: has personal loan? (binary: "yes","no")\n # related with the last contact of the current campaign:\n 9 - contact: contact communication type (categorical: "unknown","telephone","cellular") \n 10 - day: last contact day of the month (numeric)\n 11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")\n 12 - duration: last contact duration, in seconds (numeric)\n # other attributes:\n 13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)\n 14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)\n 15 - previous: number of contacts performed before this campaign and for this client (numeric)\n 16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")\n\n Output variable (desired target):\n 17 - y - has the client subscribed a term deposit? (binary: "yes","no")\n', 'citation': None}}
| name | role | type | demographic | description | units | missing_values | |
|---|---|---|---|---|---|---|---|
| 0 | age | Feature | Integer | Age | None | None | no |
| 1 | job | Feature | Categorical | Occupation | type of job (categorical: 'admin.','blue-colla... | None | no |
| 2 | marital | Feature | Categorical | Marital Status | marital status (categorical: 'divorced','marri... | None | no |
| 3 | education | Feature | Categorical | Education Level | (categorical: 'basic.4y','basic.6y','basic.9y'... | None | no |
| 4 | default | Feature | Binary | None | has credit in default? | None | no |
| 5 | balance | Feature | Integer | None | average yearly balance | euros | no |
| 6 | housing | Feature | Binary | None | has housing loan? | None | no |
| 7 | loan | Feature | Binary | None | has personal loan? | None | no |
| 8 | contact | Feature | Categorical | None | contact communication type (categorical: 'cell... | None | yes |
| 9 | day_of_week | Feature | Date | None | last contact day of the week | None | no |
| 10 | month | Feature | Date | None | last contact month of year (categorical: 'jan'... | None | no |
| 11 | duration | Feature | Integer | None | last contact duration, in seconds (numeric). ... | None | no |
| 12 | campaign | Feature | Integer | None | number of contacts performed during this campa... | None | no |
| 13 | pdays | Feature | Integer | None | number of days that passed by after the client... | None | yes |
| 14 | previous | Feature | Integer | None | number of contacts performed before this campa... | None | no |
| 15 | poutcome | Feature | Categorical | None | outcome of the previous marketing campaign (ca... | None | yes |
| 16 | y | Target | Binary | None | has the client subscribed a term deposit? | None | no |
Attributes in Data
display(X)
| age | job | marital | education | default | balance | housing | loan | contact | day_of_week | month | duration | campaign | pdays | previous | poutcome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 58 | management | married | tertiary | no | 2143 | yes | no | NaN | 5 | may | 261 | 1 | -1 | 0 | NaN |
| 1 | 44 | technician | single | secondary | no | 29 | yes | no | NaN | 5 | may | 151 | 1 | -1 | 0 | NaN |
| 2 | 33 | entrepreneur | married | secondary | no | 2 | yes | yes | NaN | 5 | may | 76 | 1 | -1 | 0 | NaN |
| 3 | 47 | blue-collar | married | NaN | no | 1506 | yes | no | NaN | 5 | may | 92 | 1 | -1 | 0 | NaN |
| 4 | 33 | NaN | single | NaN | no | 1 | no | no | NaN | 5 | may | 198 | 1 | -1 | 0 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 45206 | 51 | technician | married | tertiary | no | 825 | no | no | cellular | 17 | nov | 977 | 3 | -1 | 0 | NaN |
| 45207 | 71 | retired | divorced | primary | no | 1729 | no | no | cellular | 17 | nov | 456 | 2 | -1 | 0 | NaN |
| 45208 | 72 | retired | married | secondary | no | 5715 | no | no | cellular | 17 | nov | 1127 | 5 | 184 | 3 | success |
| 45209 | 57 | blue-collar | married | secondary | no | 668 | no | no | telephone | 17 | nov | 508 | 4 | -1 | 0 | NaN |
| 45210 | 37 | entrepreneur | married | secondary | no | 2971 | no | no | cellular | 17 | nov | 361 | 2 | 188 | 11 | other |
45211 rows × 16 columns
The Target Value in Data
display(y)
| y | |
|---|---|
| 0 | no |
| 1 | no |
| 2 | no |
| 3 | no |
| 4 | no |
| ... | ... |
| 45206 | yes |
| 45207 | yes |
| 45208 | yes |
| 45209 | no |
| 45210 | no |
45211 rows × 1 columns
Concat Attribute and the Target Value for Common Processing Steps
import pandas as pd
df = pd.concat([X, y], axis=1)
display(df)
| age | job | marital | education | default | balance | housing | loan | contact | day_of_week | month | duration | campaign | pdays | previous | poutcome | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 58 | management | married | tertiary | no | 2143 | yes | no | NaN | 5 | may | 261 | 1 | -1 | 0 | NaN | no |
| 1 | 44 | technician | single | secondary | no | 29 | yes | no | NaN | 5 | may | 151 | 1 | -1 | 0 | NaN | no |
| 2 | 33 | entrepreneur | married | secondary | no | 2 | yes | yes | NaN | 5 | may | 76 | 1 | -1 | 0 | NaN | no |
| 3 | 47 | blue-collar | married | NaN | no | 1506 | yes | no | NaN | 5 | may | 92 | 1 | -1 | 0 | NaN | no |
| 4 | 33 | NaN | single | NaN | no | 1 | no | no | NaN | 5 | may | 198 | 1 | -1 | 0 | NaN | no |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 45206 | 51 | technician | married | tertiary | no | 825 | no | no | cellular | 17 | nov | 977 | 3 | -1 | 0 | NaN | yes |
| 45207 | 71 | retired | divorced | primary | no | 1729 | no | no | cellular | 17 | nov | 456 | 2 | -1 | 0 | NaN | yes |
| 45208 | 72 | retired | married | secondary | no | 5715 | no | no | cellular | 17 | nov | 1127 | 5 | 184 | 3 | success | yes |
| 45209 | 57 | blue-collar | married | secondary | no | 668 | no | no | telephone | 17 | nov | 508 | 4 | -1 | 0 | NaN | no |
| 45210 | 37 | entrepreneur | married | secondary | no | 2971 | no | no | cellular | 17 | nov | 361 | 2 | 188 | 11 | other | no |
45211 rows × 17 columns
EDA¶
Check Missing Values
null_values =df.isnull().sum()
null_values_df = pd.DataFrame(null_values, columns=['Total Null Values'])
print("TOTAL NULL VALUES PER ATTRIBUTE:")
display(null_values_df)
TOTAL NULL VALUES PER ATTRIBUTE:
| Total Null Values | |
|---|---|
| age | 0 |
| job | 288 |
| marital | 0 |
| education | 1857 |
| default | 0 |
| balance | 0 |
| housing | 0 |
| loan | 0 |
| contact | 13020 |
| day_of_week | 0 |
| month | 0 |
| duration | 0 |
| campaign | 0 |
| pdays | 0 |
| previous | 0 |
| poutcome | 36959 |
| y | 0 |
Descriptive Statistics of Data
import pandas as pd
descriptive_stats = pd.DataFrame(df.describe())
descriptive_stats = descriptive_stats.style.set_caption("Descriptive Statistics")
print("Descriptive Statistics")
display( descriptive_stats)
Descriptive Statistics
| Â | age | balance | day_of_week | duration | campaign | pdays | previous |
|---|---|---|---|---|---|---|---|
| count | 45211.000000 | 45211.000000 | 45211.000000 | 45211.000000 | 45211.000000 | 45211.000000 | 45211.000000 |
| mean | 40.936210 | 1362.272058 | 15.806419 | 258.163080 | 2.763841 | 40.197828 | 0.580323 |
| std | 10.618762 | 3044.765829 | 8.322476 | 257.527812 | 3.098021 | 100.128746 | 2.303441 |
| min | 18.000000 | -8019.000000 | 1.000000 | 0.000000 | 1.000000 | -1.000000 | 0.000000 |
| 25% | 33.000000 | 72.000000 | 8.000000 | 103.000000 | 1.000000 | -1.000000 | 0.000000 |
| 50% | 39.000000 | 448.000000 | 16.000000 | 180.000000 | 2.000000 | -1.000000 | 0.000000 |
| 75% | 48.000000 | 1428.000000 | 21.000000 | 319.000000 | 3.000000 | -1.000000 | 0.000000 |
| max | 95.000000 | 102127.000000 | 31.000000 | 4918.000000 | 63.000000 | 871.000000 | 275.000000 |
Mode Values Per Attribute and Target Value
modes = df.mode()
modes = modes.style.set_caption("Mode Values")
print("Modes")
display(modes)
Modes
| Â | age | job | marital | education | default | balance | housing | loan | contact | day_of_week | month | duration | campaign | pdays | previous | poutcome | y |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 32 | blue-collar | married | secondary | no | 0 | yes | no | cellular | 20 | may | 124 | 1 | -1 | 0 | failure | no |
HISTOGRAMS & PAIR PLOTS¶
Histogram Distributions Of Attributes
import matplotlib.pyplot as plt
import seaborn as sns
def plot_histograms(data, plot_params, cols=4):
# Calculate the number of rows needed
num_plots = len(plot_params)
rows = (num_plots // cols) + (num_plots % cols > 0)
# Create subplots
fig, axes = plt.subplots(rows, cols, figsize=(cols * 5, rows * 3))
# Flatten axes array for easy iteration
axes = axes.flatten()
for i, (column, params) in enumerate(plot_params.items()):
rotation = params[2] if len(params) == 3 else 0
sns.histplot(data=data, x=column, bins=20, kde=True, ax=axes[i])
axes[i].set_title(params[0])
axes[i].set_xlabel(params[1])
axes[i].set_ylabel('Count')
if rotation:
axes[i].set_xticklabels(axes[i].get_xticklabels(), rotation=rotation, ha='right')
# Remove any unused subplots
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
plt.tight_layout()
plt.show()
# Dictionary to store plot parameters for each column
plot_params = {
'age': ('Distribution of Customer Ages', 'Age'),
'job': ('Distribution of Customer Jobs', 'Job', 45),
'marital': ('Distribution of Customer Marital Status', 'Marital'),
'education': ('Distribution of Customer Education Levels', 'Education Level'),
'default': ('Presence of Outstanding Debt Distribution', 'Outstanding Debt'),
'balance': ('Distribution of Average Annual Salary', 'Annual Salary'),
'housing': ('Presence of Housing Loan Distribution', 'Housing Loan'),
'loan': ('Presence of Personal Loan Distribution', 'Personal Loan'),
'contact': ('Distribution of Communication Channels', 'Communication Channels'),
'day_of_week': ('Distribution of the Last Contact Days of Weeks', 'Day'),
'month': ('Distribution of the Last Contact Months of Years', 'Month'),
'duration': ('Distribution of Communication Time in Seconds', 'Time in Seconds'),
'campaign': ('Distribution of the Numbers of Contacts', 'Contacts'),
'pdays': ('Distribution of Numbers of Days Since the Last Contacts', 'Days'),
'previous': ('Distribution of Numbers of Contacts Before the Campaign', 'Contacts'),
'poutcome': ('Distribution of the Results of the Previous Campaign', 'Results')
}
# Plot histograms in a grid layout
plot_histograms(X, plot_params)
C:\Users\elifr\AppData\Local\Temp\ipykernel_1684\3588507636.py:22: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. axes[i].set_xticklabels(axes[i].get_xticklabels(), rotation=rotation, ha='right')
Histogram Distribution of the Target Value
plt.figure(figsize=(5, 3))
sns.histplot(data=y, x='y', bins=20, kde=True)
plt.title('Distribution of Target Value')
plt.xlabel('Target Value')
plt.ylabel('Count')
plt.show()
Pair Plots
print("PAIR PLOTS OF ATTRIBUTES\n")
sns.pairplot(df, diag_kind='kde', hue='y')
plt.suptitle('PAIR PLOTS OF ATTRIBUTES', y=1.02)
plt.show()
PAIR PLOTS OF ATTRIBUTES
BOXPLOT GENERATION¶
As it can be understood from the definition of some attributes and their distributions, it is not that logical to draw the boxplot. Even though this is the case, boxplot generation for each attribute has been performed in order not to miss anything.
import matplotlib.pyplot as plt
import seaborn as sns
# Get the list of all columns in the X
columns = X.columns
# Number of columns for the subplot grid
cols = 4 # Adjust this number based on how many columns you want per row
rows = (len(columns) // cols) + (len(columns) % cols > 0)
# Create subplots
fig, axes = plt.subplots(rows, cols, figsize=(cols * 5, rows * 4))
# Flatten axes array for easy iteration
axes = axes.flatten()
for i, column in enumerate(columns):
sns.boxplot(x=X[column], ax=axes[i])
axes[i].set_title(f'Boxplot of {column.capitalize()}')
axes[i].set_xlabel(column.capitalize())
# Remove any unused subplots
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
plt.tight_layout()
plt.show()
Correlation Heatmap¶
df_encoded = pd.get_dummies(df, drop_first=True)
# Check for and handle missing values
df_encoded = df_encoded.dropna()
# Calculate the correlation matrix
corr_matrix = df_encoded.corr()
# Generate the heatmap
plt.figure(figsize=(20, 15))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()
Check & Handle Outliers
import numpy as np
u, ind = np.unique(y, return_inverse=True)
plt.scatter(ind, df['balance'])
plt.show()
plt.scatter( df['duration'], df['balance'])
<matplotlib.collections.PathCollection at 0x28002d36410>
plt.scatter(df['age'], df['balance'])
<matplotlib.collections.PathCollection at 0x2807f2955d0>
X = X[X['balance'] <= 55000]
plt.figure(figsize=(4, 4))
sns.boxplot(df['balance'])
plt.show()
Education - Job¶
sns.boxplot(data=df, x="education", y="job")
<Axes: xlabel='education', ylabel='job'>
NaN Değer Kontrolü
import pandas as pd
nan_counts = {column: df[column].isna().sum() for column in df.columns}
nan_counts_df = pd.DataFrame([nan_counts], index=['NaN Counts'])
display(nan_counts_df)
| age | job | marital | education | default | balance | housing | loan | contact | day_of_week | month | duration | campaign | pdays | previous | poutcome | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NaN Counts | 0 | 288 | 0 | 1857 | 0 | 0 | 0 | 0 | 13020 | 0 | 0 | 0 | 0 | 0 | 0 | 36959 | 0 |
Fill the Null Values in Education
X.loc[(X['education'].isnull()) & (X['job'] == 'admin.'), 'education'] = 'secondary'
X.loc[(X['education'].isnull()) & (X['job'] == 'blue-collar'), 'education'] = 'secondary'
X.loc[(X['education'].isnull()) & (X['job'] == 'entrepreneur'), 'education'] = 'tertiary'
X.loc[(X['education'].isnull()) & (X['job'] == 'housemaid'), 'education'] = 'primary'
X.loc[(X['education'].isnull()) & (X['job'] == 'management'), 'education'] = 'tertiary'
X.loc[(X['education'].isnull()) & (X['job'] == 'retired'), 'education'] = 'secondary'
X.loc[(X['education'].isnull()) & (X['job'] == 'self-employed'), 'education'] = 'tertiary'
X.loc[(X['education'].isnull()) & (X['job'] == 'services'), 'education'] = 'secondary'
X.loc[(X['education'].isnull()) & (X['job'] == 'student'), 'education'] = 'secondary'
X.loc[(X['education'].isnull()) & (X['job'] == 'technician'), 'education'] = 'secondary'
X.loc[(X['education'].isnull()) & (X['job'] == 'unemployed'), 'education'] = 'secondary'
Null Value Control After Filling in the 'education' Column According to the 'education'-'job' Relationship
job_counts = df['job'].value_counts(dropna=False)
plt.figure(figsize = (6, 6))
plt.pie(job_counts, labels=job_counts.index, autopct='%1.1f%%')
education_counts = df['education'].value_counts(dropna=False)
plt.figure(figsize = (6, 6))
plt.pie(education_counts, labels=education_counts.index, autopct='%1.1f%%')
poutcome_counts = df['poutcome'].value_counts(dropna=False)
plt.figure(figsize = (6, 6))
plt.pie(poutcome_counts, labels=poutcome_counts.index, autopct='%1.1f%%')
plt.show()
pdays_counts = df['pdays'].value_counts(dropna=False)
plt.figure(figsize=(6, 6))
plt.pie(pdays_counts, labels=pdays_counts.index, autopct='%1.1f%%')
plt.show()
nan_counts = {column: X[column].isna().sum() for column in X.columns}
nan_counts_df = pd.DataFrame([nan_counts], index=['NaN Counts'])
display(nan_counts_df)
| age | job | marital | education | default | balance | housing | loan | contact | day_of_week | month | duration | campaign | pdays | previous | poutcome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NaN Counts | 0 | 288 | 0 | 127 | 0 | 0 | 0 | 0 | 13018 | 0 | 0 | 0 | 0 | 0 | 0 | 36948 |
df = df.drop(columns=['contact', 'poutcome'])
nan_counts = {column: df[column].isna().sum() for column in df.columns}
nan_counts_df = pd.DataFrame([nan_counts], index=['NaN Counts'])
display(nan_counts_df)
| age | job | marital | education | default | balance | housing | loan | day_of_week | month | duration | campaign | pdays | previous | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NaN Counts | 0 | 288 | 0 | 1857 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
df = df.dropna()
nan_counts = {column: df[column].isna().sum() for column in df.columns}
nan_counts_df = pd.DataFrame([nan_counts], index=['NaN Counts'])
display(nan_counts_df)
| age | job | marital | education | default | balance | housing | loan | day_of_week | month | duration | campaign | pdays | previous | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NaN Counts | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
display(df)
| age | job | marital | education | default | balance | housing | loan | day_of_week | month | duration | campaign | pdays | previous | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 58 | management | married | tertiary | no | 2143 | yes | no | 5 | may | 261 | 1 | -1 | 0 | no |
| 1 | 44 | technician | single | secondary | no | 29 | yes | no | 5 | may | 151 | 1 | -1 | 0 | no |
| 2 | 33 | entrepreneur | married | secondary | no | 2 | yes | yes | 5 | may | 76 | 1 | -1 | 0 | no |
| 5 | 35 | management | married | tertiary | no | 231 | yes | no | 5 | may | 139 | 1 | -1 | 0 | no |
| 6 | 28 | management | single | tertiary | no | 447 | yes | yes | 5 | may | 217 | 1 | -1 | 0 | no |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 45206 | 51 | technician | married | tertiary | no | 825 | no | no | 17 | nov | 977 | 3 | -1 | 0 | yes |
| 45207 | 71 | retired | divorced | primary | no | 1729 | no | no | 17 | nov | 456 | 2 | -1 | 0 | yes |
| 45208 | 72 | retired | married | secondary | no | 5715 | no | no | 17 | nov | 1127 | 5 | 184 | 3 | yes |
| 45209 | 57 | blue-collar | married | secondary | no | 668 | no | no | 17 | nov | 508 | 4 | -1 | 0 | no |
| 45210 | 37 | entrepreneur | married | secondary | no | 2971 | no | no | 17 | nov | 361 | 2 | 188 | 11 | no |
43193 rows × 15 columns
min_value = df['day_of_week'].min()
max_value = df['day_of_week'].max()
min_max_df = pd.DataFrame({
'Statistic': ['Minimum deÄŸer', 'Maksimum deÄŸer'],
'day_of_week Value': [min_value, max_value]
})
display(min_max_df)
| Statistic | day_of_week Value | |
|---|---|---|
| 0 | Minimum deÄŸer | 1 |
| 1 | Maksimum deÄŸer | 31 |
job_value_counts = df['job'].value_counts()
job_counts_df = pd.DataFrame(list(job_value_counts.items()), columns=['Job', 'Counts'])
display(job_counts_df.T)
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Job | blue-collar | management | technician | admin. | services | retired | self-employed | entrepreneur | unemployed | housemaid | student |
| Counts | 9278 | 9216 | 7355 | 5000 | 4004 | 2145 | 1540 | 1411 | 1274 | 1195 | 775 |
categorical_columns = df.select_dtypes(include=['object']).columns
print("Kategorik verili sütunlar:")
categorical = pd.DataFrame(categorical_columns, columns=['Categorical Columns'])
display(categorical.T)
Kategorik verili sütunlar:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
|---|---|---|---|---|---|---|---|---|
| Categorical Columns | job | marital | education | default | housing | loan | month | y |
mon_value = df['month'].unique()
mon_value_df = pd.DataFrame(list(mon_value), columns=['Month'])
display(mon_value_df.T)
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Month | may | jun | jul | aug | oct | nov | dec | jan | feb | mar | apr | sep |
ENCODING¶
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
encoder = LabelEncoder()
for column in ['job', 'marital', 'education', 'default', 'housing', 'loan']:
df[column] = encoder.fit_transform(df[column])
display(X.head())
| age | job | marital | education | default | balance | housing | loan | contact | day_of_week | month | duration | campaign | pdays | previous | poutcome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 58 | management | married | tertiary | no | 2143 | yes | no | NaN | 5 | may | 261 | 1 | -1 | 0 | NaN |
| 1 | 44 | technician | single | secondary | no | 29 | yes | no | NaN | 5 | may | 151 | 1 | -1 | 0 | NaN |
| 2 | 33 | entrepreneur | married | secondary | no | 2 | yes | yes | NaN | 5 | may | 76 | 1 | -1 | 0 | NaN |
| 3 | 47 | blue-collar | married | secondary | no | 1506 | yes | no | NaN | 5 | may | 92 | 1 | -1 | 0 | NaN |
| 4 | 33 | NaN | single | NaN | no | 1 | no | no | NaN | 5 | may | 198 | 1 | -1 | 0 | NaN |
month_mapping = {
'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'may': 5, 'jun': 6,
'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12
}
df['month'] = df['month'].map(month_mapping)
df = df.sample(frac=1).reset_index(drop=True)
display(df.head())
| age | job | marital | education | default | balance | housing | loan | day_of_week | month | duration | campaign | pdays | previous | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 26 | 4 | 1 | 2 | 0 | 877 | 1 | 0 | 22 | 7 | 611 | 1 | -1 | 0 | no |
| 1 | 46 | 4 | 2 | 2 | 0 | -867 | 1 | 1 | 21 | 5 | 222 | 5 | -1 | 0 | no |
| 2 | 30 | 4 | 2 | 2 | 0 | 19796 | 0 | 0 | 20 | 11 | 41 | 1 | -1 | 0 | no |
| 3 | 33 | 1 | 2 | 1 | 0 | 56 | 1 | 0 | 29 | 1 | 290 | 4 | -1 | 0 | no |
| 4 | 43 | 9 | 2 | 1 | 0 | 336 | 1 | 0 | 9 | 5 | 181 | 1 | -1 | 0 | no |
df['y'] = df['y'].map({'no': 0, 'yes': 1})
display(df.head())
| age | job | marital | education | default | balance | housing | loan | day_of_week | month | duration | campaign | pdays | previous | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 26 | 4 | 1 | 2 | 0 | 877 | 1 | 0 | 22 | 7 | 611 | 1 | -1 | 0 | 0 |
| 1 | 46 | 4 | 2 | 2 | 0 | -867 | 1 | 1 | 21 | 5 | 222 | 5 | -1 | 0 | 0 |
| 2 | 30 | 4 | 2 | 2 | 0 | 19796 | 0 | 0 | 20 | 11 | 41 | 1 | -1 | 0 | 0 |
| 3 | 33 | 1 | 2 | 1 | 0 | 56 | 1 | 0 | 29 | 1 | 290 | 4 | -1 | 0 | 0 |
| 4 | 43 | 9 | 2 | 1 | 0 | 336 | 1 | 0 | 9 | 5 | 181 | 1 | -1 | 0 | 0 |
NORMALIZATION - ROBUST SCALING¶
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
df = scaler.fit_transform(df)
from sklearn.model_selection import train_test_split
X_features = df[:, :-1]
y_target = df[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X_features, y_target, test_size=0.2, random_state=2)
X_df = pd.DataFrame(X_features)
from sklearn.cluster import KMeans
km = KMeans(n_clusters=2,
init='k-means++',
n_init=10,
max_iter=100,
random_state=42)
clusters_predict = km.fit_predict(X_df)
# Calculation the principal components in 2D and 3D
import prince
import plotly.express as px
def get_pca_2d(df, predict):
pca_2d_object = prince.PCA(
n_components=2,
n_iter=3,
rescale_with_mean=True,
rescale_with_std=True,
copy=True,
check_input=True,
engine='sklearn',
random_state=42
)
pca_2d_object.fit(df)
df_pca_2d = pca_2d_object.transform(df)
df_pca_2d.columns = ["comp1", "comp2"]
df_pca_2d["cluster"] = predict
return pca_2d_object, df_pca_2d
def get_pca_3d(df, predict):
pca_3d_object = prince.PCA(
n_components=3,
n_iter=3,
rescale_with_mean=True,
rescale_with_std=True,
copy=True,
check_input=True,
engine='sklearn',
random_state=42
)
pca_3d_object.fit(df)
df_pca_3d = pca_3d_object.transform(df)
df_pca_3d.columns = ["comp1", "comp2", "comp3"]
df_pca_3d["cluster"] = predict
return pca_3d_object, df_pca_3d
def plot_pca_3d(df, title = "PCA Space", opacity=0.8, width_line = 0.1):
df = df.astype({"cluster": "object"})
df = df.sort_values("cluster")
columns = df.columns[0:3].tolist()
fig = px.scatter_3d(df,
x=columns[0],
y=columns[1],
z=columns[2],
color='cluster',
template="plotly",
# symbol = "cluster",
color_discrete_sequence=px.colors.qualitative.Vivid,
title=title).update_traces(
# mode = 'markers',
marker={
"size": 4,
"opacity": opacity,
# "symbol" : "diamond",
"line": {
"width": width_line,
"color": "black",
}
}
).update_layout(
width = 1000,
height = 800,
autosize = False,
showlegend = True,
legend=dict(title_font_family="Times New Roman",
font=dict(size= 20)),
scene = dict(xaxis=dict(title = 'comp1', titlefont_color = 'black'),
yaxis=dict(title = 'comp2', titlefont_color = 'black'),
zaxis=dict(title = 'comp3', titlefont_color = 'black')),
font = dict(family = "Gilroy", color = 'black', size = 15))
fig.show()
pca_3d_object, df_pca_3d = get_pca_3d(X_df, clusters_predict)
plot_pca_3d(df_pca_3d, title = "PCA Space", opacity=1, width_line = 0.1)
print("The variability is :", pca_3d_object.eigenvalues_summary)
The variability is : eigenvalue % of variance % of variance (cumulative) component 0 1.653 11.81% 11.81% 1 1.510 10.78% 22.59% 2 1.355 9.68% 32.28%
def plot_pca_2d(df, title = "PCA Space", opacity=0.8, width_line = 0.1):
df = df.astype({"cluster": "object"})
df = df.sort_values("cluster")
columns = df.columns[0:3].tolist()
fig = px.scatter(df,
x=columns[0],
y=columns[1],
color='cluster',
template="plotly",
# symbol = "cluster",
color_discrete_sequence=px.colors.qualitative.Vivid,
title=title).update_traces(
# mode = 'markers',
marker={
"size": 8,
"opacity": opacity,
# "symbol" : "diamond",
"line": {
"width": width_line,
"color": "black",
}
}
).update_layout(
width = 800,
height = 700,
autosize = False,
showlegend = True,
legend=dict(title_font_family="Times New Roman",
font=dict(size= 20)),
scene = dict(xaxis=dict(title = 'comp1', titlefont_color = 'black'),
yaxis=dict(title = 'comp2', titlefont_color = 'black'),
),
font = dict(family = "Gilroy", color = 'black', size = 15))
fig.show()
pca_2d_object, df_pca_2d = get_pca_2d(X_df, clusters_predict)
plot_pca_2d(df_pca_2d, title = "PCA Space", opacity=1, width_line = 0.5)
MODELLING¶
Machine Learning Models¶
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
# Random Forest
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train) # No need to scale features for Random Forest
lr = LogisticRegression(max_iter=500 , random_state = 2)
lr.fit(X_train,y_train)
LogisticRegression(max_iter=500, random_state=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=500, random_state=2)
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
rf_predictions = rf.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_predictions)
rf_recall = recall_score(y_test, rf_predictions, average='macro')
rf_precision = precision_score(y_test, rf_predictions, average='macro')
rf_f1 = f1_score(y_test, rf_predictions, average='macro')
lr_predictions = lr.predict(X_test)
lr_accuracy = accuracy_score(y_test, lr_predictions)
lr_recall = recall_score(y_test, lr_predictions, average='macro')
lr_precision = precision_score(y_test, lr_predictions, average='macro')
lr_f1 = f1_score(y_test, lr_predictions, average='macro')
Confusion Matrices & Classification Report
from sklearn.metrics import classification_report, confusion_matrix
rf_conf_matrix = confusion_matrix(y_test, rf_predictions)
rf_class_report = classification_report(y_test, rf_predictions, output_dict=True)
rf_report_df = pd.DataFrame(rf_class_report).transpose()
print("Decision Tree Confusion Matrix:")
display(pd.DataFrame(rf_conf_matrix, columns=np.unique(y_test), index=np.unique(y_test)))
print("Decision Tree Classification Report:")
display(rf_report_df)
lr_conf_matrix = confusion_matrix(y_test, lr_predictions)
lr_class_report = classification_report(y_test, lr_predictions, output_dict=True)
lr_report_df = pd.DataFrame(lr_class_report).transpose()
print("Logistic Regression Confusion Matrix:")
display(pd.DataFrame(lr_conf_matrix, columns=np.unique(y_test), index=np.unique(y_test)))
print("Logistic Regression Classification Report:")
display(lr_report_df)
Decision Tree Confusion Matrix:
| 0.0 | 1.0 | |
|---|---|---|
| 0.0 | 7469 | 227 |
| 1.0 | 573 | 370 |
Decision Tree Classification Report:
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0.0 | 0.928749 | 0.970504 | 0.949168 | 7696.000000 |
| 1.0 | 0.619765 | 0.392365 | 0.480519 | 943.000000 |
| accuracy | 0.907397 | 0.907397 | 0.907397 | 0.907397 |
| macro avg | 0.774257 | 0.681434 | 0.714844 | 8639.000000 |
| weighted avg | 0.895022 | 0.907397 | 0.898012 | 8639.000000 |
Logistic Regression Confusion Matrix:
| 0.0 | 1.0 | |
|---|---|---|
| 0.0 | 7548 | 148 |
| 1.0 | 750 | 193 |
Logistic Regression Classification Report:
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0.0 | 0.909617 | 0.980769 | 0.943854 | 7696.000000 |
| 1.0 | 0.565982 | 0.204666 | 0.300623 | 943.000000 |
| accuracy | 0.896053 | 0.896053 | 0.896053 | 0.896053 |
| macro avg | 0.737800 | 0.592718 | 0.622238 | 8639.000000 |
| weighted avg | 0.872107 | 0.896053 | 0.873641 | 8639.000000 |
results = pd.DataFrame({
'Metric': ['Accuracy', 'Recall', 'Precision', 'F1 Score'],
'Random Forest': [rf_accuracy, rf_recall, rf_precision, rf_f1],
'Logistic Regression': [lr_accuracy, lr_recall, lr_precision, lr_f1]
})
# Display the table
display(results)
| Metric | Random Forest | Logistic Regression | |
|---|---|---|---|
| 0 | Accuracy | 0.907397 | 0.896053 |
| 1 | Recall | 0.681434 | 0.592718 |
| 2 | Precision | 0.774257 | 0.737800 |
| 3 | F1 Score | 0.714844 | 0.622238 |
ROC - AUC Curves
from sklearn.metrics import roc_curve, roc_auc_score
rf_auc = roc_auc_score(y_test, rf_predictions)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, rf_predictions)
print('ROC_AUC_SCORE for Random Forest is', rf_auc)
plt.plot(false_positive_rate, true_positive_rate)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC Curve for Random Forest')
plt.show()
lr_auc = roc_auc_score(y_test, lr_predictions)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, lr_predictions)
print('ROC_AUC_SCORE for Logistic Regressiın is', lr_auc)
plt.plot(false_positive_rate, true_positive_rate)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC Curve for Logistic Regression')
plt.show()
ROC_AUC_SCORE for Random Forest is 0.6814344756086538
ROC_AUC_SCORE for Logistic Regressiın is 0.592717595236153
Grid Search & Cross-Validation for Hyperparameter Tuning
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 300, num=3)]
max_features = ['sqrt'] # # of features to consider at every split
max_depth = [int(x) for x in np.linspace(5, 10, num=2)] # max # of levels in tree
max_depth.append(None)
min_samples_split = [2, 5] # min # of samples required to split a node
min_samples_leaf = [1, 2] # min # of samples required at each leaf node
criterion = ['gini', 'entropy']
rf_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'criterion': criterion}
print("Grid Parameters for Random Forest")
display((rf_grid))
penalty = ["l2"] # norm of the penalty
C = [0.001, 0.01, 0.1, 1, 10] # inverse of regularization strength
lr_grid = {"penalty": penalty,
"C": C}
print("Grid Parameters for Logistic Regression")
display((lr_grid))
Grid Parameters for Random Forest
{'n_estimators': [100, 200, 300],
'max_features': ['sqrt'],
'max_depth': [5, 10, None],
'min_samples_split': [2, 5],
'min_samples_leaf': [1, 2],
'criterion': ['gini', 'entropy']}
Grid Parameters for Logistic Regression
{'penalty': ['l2'], 'C': [0.001, 0.01, 0.1, 1, 10]}
from sklearn.model_selection import GridSearchCV
rf_grid_search = GridSearchCV(RandomForestClassifier(), rf_grid, cv=5, scoring='accuracy', n_jobs=-1)
rf_grid_search.fit(X_train, y_train)
best_rf = rf_grid_search.best_estimator_
print("Best Random Forest Parameters:", rf_grid_search.best_params_)
lr_grid_search = GridSearchCV(LogisticRegression(max_iter=500), lr_grid, cv=5, scoring='accuracy', n_jobs=-1)
lr_grid_search.fit(X_train, y_train)
best_lr = lr_grid_search.best_estimator_
print("Best SVC Parameters:", lr_grid_search.best_params_)
Best Random Forest Parameters: {'criterion': 'entropy', 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}
Best SVC Parameters: {'C': 0.01, 'penalty': 'l2'}
rf_params = rf_grid_search.best_params_
best_rf = RandomForestClassifier(**rf_params)
best_rf.fit(X_train, y_train)
lr_params = lr_grid_search.best_params_
best_lr = LogisticRegression(**lr_params, max_iter=500)
best_lr.fit(X_train, y_train)
LogisticRegression(C=0.01, max_iter=500)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=0.01, max_iter=500)
rf_predictions_grid = best_rf.predict(X_test)
rf_conf_matrix_grid = confusion_matrix(y_test, rf_predictions_grid)
rf_class_report_grid = classification_report(y_test, rf_predictions_grid, output_dict=True)
rf_report_df_grid = pd.DataFrame(rf_class_report_grid).transpose()
rf_accuracy_grid = accuracy_score(y_test, rf_predictions_grid)
rf_recall_grid = recall_score(y_test, rf_predictions_grid, average='macro')
rf_precision_grid = precision_score(y_test, rf_predictions_grid, average='macro')
rf_f1_grid = f1_score(y_test, rf_predictions_grid, average='macro')
print("Random Forest Accuracy Using Grid Search Parameters:", rf_accuracy_grid)
print("Random Forest Confusion Matrix Using Grid Search Parameters:")
display(pd.DataFrame(rf_conf_matrix_grid, columns=np.unique(y_test), index=np.unique(y_test)))
print("Random Forest Classification Report Using Grid Search Parameters:")
display(rf_report_df_grid)
lr_predictions_grid = best_lr.predict(X_test)
lr_conf_matrix_grid = confusion_matrix(y_test, lr_predictions_grid)
lr_class_report_grid = classification_report(y_test, lr_predictions_grid, output_dict=True)
lr_report_df_grid = pd.DataFrame(lr_class_report_grid).transpose()
lr_accuracy_grid = accuracy_score(y_test, lr_predictions_grid)
lr_recall_grid = recall_score(y_test, lr_predictions_grid, average='macro')
lr_precision_grid = precision_score(y_test, lr_predictions_grid, average='macro')
lr_f1_grid = f1_score(y_test, lr_predictions_grid, average='macro')
print("Logistic Regression Accuracy Using Grid Search Parameters:", lr_accuracy_grid)
print("Logistic Regression Confusion Matrix Using Grid Search Parameters:")
display(pd.DataFrame(lr_conf_matrix_grid, columns=np.unique(y_test), index=np.unique(y_test)))
print("Logistic Regression Classification Report Using Grid Search Parameters:")
display(lr_report_df_grid)
Random Forest Accuracy Using Grid Search Parameters: 0.9071651811552263 Random Forest Confusion Matrix Using Grid Search Parameters:
| 0.0 | 1.0 | |
|---|---|---|
| 0.0 | 7477 | 219 |
| 1.0 | 583 | 360 |
Random Forest Classification Report Using Grid Search Parameters:
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0.0 | 0.927667 | 0.971544 | 0.949099 | 7696.000000 |
| 1.0 | 0.621762 | 0.381760 | 0.473062 | 943.000000 |
| accuracy | 0.907165 | 0.907165 | 0.907165 | 0.907165 |
| macro avg | 0.774715 | 0.676652 | 0.711080 | 8639.000000 |
| weighted avg | 0.894276 | 0.907165 | 0.897136 | 8639.000000 |
Logistic Regression Accuracy Using Grid Search Parameters: 0.8958212756106031 Logistic Regression Confusion Matrix Using Grid Search Parameters:
| 0.0 | 1.0 | |
|---|---|---|
| 0.0 | 7561 | 135 |
| 1.0 | 765 | 178 |
Logistic Regression Classification Report Using Grid Search Parameters:
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0.0 | 0.908119 | 0.982458 | 0.943827 | 7696.000000 |
| 1.0 | 0.568690 | 0.188759 | 0.283439 | 943.000000 |
| accuracy | 0.895821 | 0.895821 | 0.895821 | 0.895821 |
| macro avg | 0.738405 | 0.585609 | 0.613633 | 8639.000000 |
| weighted avg | 0.871068 | 0.895821 | 0.871742 | 8639.000000 |
results = pd.DataFrame({
'Metric': ['Accuracy', 'Recall', 'Precision', 'F1 Score'],
'Random Forest': [rf_accuracy, rf_recall, rf_precision, rf_f1],
'Logistic Regression': [lr_accuracy, lr_recall, lr_precision, lr_f1],
'Random Forest Using Grid Search Parameters':[rf_accuracy_grid, rf_recall_grid, rf_precision_grid, rf_f1_grid],
'Logistic Regression Grid Search Parameters': [lr_accuracy_grid, lr_recall_grid, lr_precision_grid, lr_f1_grid]
})
# Display the table
display(results)
| Metric | Random Forest | Logistic Regression | Random Forest Using Grid Search Parameters | Logistic Regression Grid Search Parameters | |
|---|---|---|---|---|---|
| 0 | Accuracy | 0.907397 | 0.896053 | 0.907165 | 0.895821 |
| 1 | Recall | 0.681434 | 0.592718 | 0.676652 | 0.585609 |
| 2 | Precision | 0.774257 | 0.737800 | 0.774715 | 0.738405 |
| 3 | F1 Score | 0.714844 | 0.622238 | 0.711080 | 0.613633 |
Improve Model Performance Using Ensemble Methods
from sklearn.ensemble import AdaBoostClassifier, StackingClassifier
# Boosting with AdaBooST using the Best Random Forest
boosting = AdaBoostClassifier(best_rf, n_estimators=50, random_state=2, algorithm="SAMME")
boosting.fit(X_train, y_train)
boosting_predictions = boosting.predict(X_test)
print("Boosting with AdaBoost Accuracy:", accuracy_score(y_test, boosting_predictions))
# Stacking with Decision Tree, Naive Bayes, and SVM
# using decision tree classifier as the final estimator
models = [('rf', best_lr), ('lr', best_lr)]
stacking1 = StackingClassifier(estimators=models, final_estimator=RandomForestClassifier())
stacking1.fit(X_train, y_train)
stacking1_predictions = stacking1.predict(X_test)
print("Stacking Accuracy with Decision Tree as the Final Estimator:", accuracy_score(y_test, stacking1_predictions))
# using naive bayes as the final estimator
stacking2 = StackingClassifier(estimators=models, final_estimator=LogisticRegression())
stacking2.fit(X_train, y_train)
stacking2_predictions = stacking2.predict(X_test)
print("Stacking Accuracy withh Naive Bayes as the Final Estimator:", accuracy_score(y_test, stacking2_predictions))
Boosting with AdaBoost Accuracy: 0.9038083111471235 Stacking Accuracy with Decision Tree as the Final Estimator: 0.8437319134159046 Stacking Accuracy withh Naive Bayes as the Final Estimator: 0.8958212756106031
boosting_accuracy = accuracy_score(y_test, boosting_predictions)
boosting_recall = recall_score(y_test, boosting_predictions, average='macro')
boosting_precision = precision_score(y_test, boosting_predictions, average='macro')
boosting_f1 = f1_score(y_test, boosting_predictions, average='macro')
boosting_conf_matrix = confusion_matrix(y_test, rf_predictions_grid)
boosting_class_report = classification_report(y_test, rf_predictions_grid, output_dict=True)
boosting_report_df = pd.DataFrame(rf_class_report_grid).transpose()
print("AdaBoost Using Random Forest Estimator Accuracy:", boosting_accuracy)
print("AdaBoost Using Random Forest Estimator Confusion Matrix:")
display(pd.DataFrame(boosting_conf_matrix, columns=np.unique(y_test), index=np.unique(y_test)))
print("AdaBoost Using Random Forest Estimator Classification Report:")
display(boosting_report_df)
stacking1_accuracy = accuracy_score(y_test, stacking1_predictions)
stacking1_recall = recall_score(y_test, stacking1_predictions, average='macro')
stacking1_precision = precision_score(y_test, stacking1_predictions, average='macro')
stacking1_f1 = f1_score(y_test, stacking1_predictions, average='macro')
stacking1_conf_matrix = confusion_matrix(y_test, stacking1_predictions)
stacking1_class_report = classification_report(y_test, stacking1_predictions, output_dict=True)
stacking1_report_df = pd.DataFrame(stacking1_class_report).transpose()
print("Stacking Using Random Forest as the Final Estimator Accuracy:", stacking1_accuracy)
print("Stacking Using Random Forest as the Final Estimator Confusion Matrix:")
display(pd.DataFrame(stacking1_conf_matrix, columns=np.unique(y_test), index=np.unique(y_test)))
print("Stacking Using Random Forest as the Final Estimator Classification Report:")
display(stacking1_report_df)
stacking2_accuracy = accuracy_score(y_test, stacking2_predictions)
stacking2_recall = recall_score(y_test, stacking2_predictions, average='macro')
stacking2_precision = precision_score(y_test, stacking2_predictions, average='macro')
stacking2_f1 = f1_score(y_test, stacking2_predictions, average='macro')
stacking2_conf_matrix = confusion_matrix(y_test, stacking2_predictions)
stacking2_class_report = classification_report(y_test, stacking2_predictions, output_dict=True)
stacking2_report_df = pd.DataFrame(stacking2_class_report).transpose()
print("Stacking Using Logistic Regression as the Final Estimator Accuracy:", stacking2_accuracy)
print("Stacking Using Logistic Regression as the Final Estimator Confusion Matrix:")
display(pd.DataFrame(stacking2_conf_matrix, columns=np.unique(y_test), index=np.unique(y_test)))
print("Stacking Using Logistic Regression as the Final Estimator Classification Report:")
display(stacking2_report_df)
AdaBoost Using Random Forest Estimator Accuracy: 0.9038083111471235 AdaBoost Using Random Forest Estimator Confusion Matrix:
| 0.0 | 1.0 | |
|---|---|---|
| 0.0 | 7477 | 219 |
| 1.0 | 583 | 360 |
AdaBoost Using Random Forest Estimator Classification Report:
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0.0 | 0.927667 | 0.971544 | 0.949099 | 7696.000000 |
| 1.0 | 0.621762 | 0.381760 | 0.473062 | 943.000000 |
| accuracy | 0.907165 | 0.907165 | 0.907165 | 0.907165 |
| macro avg | 0.774715 | 0.676652 | 0.711080 | 8639.000000 |
| weighted avg | 0.894276 | 0.907165 | 0.897136 | 8639.000000 |
Stacking Using Random Forest as the Final Estimator Accuracy: 0.8437319134159046 Stacking Using Random Forest as the Final Estimator Confusion Matrix:
| 0.0 | 1.0 | |
|---|---|---|
| 0.0 | 6979 | 717 |
| 1.0 | 633 | 310 |
Stacking Using Random Forest as the Final Estimator Classification Report:
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0.0 | 0.916842 | 0.906835 | 0.911811 | 7696.000000 |
| 1.0 | 0.301850 | 0.328738 | 0.314721 | 943.000000 |
| accuracy | 0.843732 | 0.843732 | 0.843732 | 0.843732 |
| macro avg | 0.609346 | 0.617786 | 0.613266 | 8639.000000 |
| weighted avg | 0.849712 | 0.843732 | 0.846635 | 8639.000000 |
Stacking Using Logistic Regression as the Final Estimator Accuracy: 0.8958212756106031 Stacking Using Logistic Regression as the Final Estimator Confusion Matrix:
| 0.0 | 1.0 | |
|---|---|---|
| 0.0 | 7542 | 154 |
| 1.0 | 746 | 197 |
Stacking Using Logistic Regression as the Final Estimator Classification Report:
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0.0 | 0.909990 | 0.979990 | 0.943694 | 7696.000000 |
| 1.0 | 0.561254 | 0.208908 | 0.304482 | 943.000000 |
| accuracy | 0.895821 | 0.895821 | 0.895821 | 0.895821 |
| macro avg | 0.735622 | 0.594449 | 0.624088 | 8639.000000 |
| weighted avg | 0.871924 | 0.895821 | 0.873920 | 8639.000000 |
results = pd.DataFrame({
'Metric': ['Accuracy', 'Recall', 'Precision', 'F1 Score'],
'Random Forest': [rf_accuracy, rf_recall, rf_precision, rf_f1],
'Logistic Regression': [lr_accuracy, lr_recall, lr_precision, lr_f1],
'Random Forest Using Grid Search Parameters':[rf_accuracy_grid, rf_recall_grid, rf_precision_grid, rf_f1_grid],
'Logistic Regression Grid Search Parameters': [lr_accuracy_grid, lr_recall_grid, lr_precision_grid, lr_f1_grid],
'AdaBoost Using Random Forest Estimator':[boosting_accuracy, boosting_recall, boosting_precision, boosting_f1],
'Stacking Using Random Forest as the Final Estimator':[stacking1_accuracy, stacking1_recall, stacking1_precision, stacking1_f1],
'Stacking Using Logistic Regression as the Final Estimator':[stacking2_accuracy, stacking2_recall, stacking2_precision, stacking2_f1]
})
# Display the table
display(results)
| Metric | Random Forest | Logistic Regression | Random Forest Using Grid Search Parameters | Logistic Regression Grid Search Parameters | AdaBoost Using Random Forest Estimator | Stacking Using Random Forest as the Final Estimator | Stacking Using Logistic Regression as the Final Estimator | |
|---|---|---|---|---|---|---|---|---|
| 0 | Accuracy | 0.907397 | 0.896053 | 0.907165 | 0.895821 | 0.903808 | 0.843732 | 0.895821 |
| 1 | Recall | 0.681434 | 0.592718 | 0.676652 | 0.585609 | 0.607771 | 0.617786 | 0.594449 |
| 2 | Precision | 0.774257 | 0.737800 | 0.774715 | 0.738405 | 0.793805 | 0.609346 | 0.735622 |
| 3 | F1 Score | 0.714844 | 0.622238 | 0.711080 | 0.613633 | 0.645077 | 0.613266 | 0.624088 |
y_test
array([1., 0., 0., ..., 0., 0., 1.])
ROC Curves
#ROC Curves
# Since this is a binary classification, ...
fpr1 , tpr1, thresholds1 = roc_curve(y_test, rf.predict_proba(X_test)[:, 1])
fpr2 , tpr2, thresholds2 = roc_curve(y_test, lr.predict_proba(X_test)[:, 1])
fpr3 , tpr3, thresholds3 = roc_curve(y_test, best_rf.predict_proba(X_test)[:, 1])
fpr4 , tpr4, thresholds4 = roc_curve(y_test, best_lr.predict_proba(X_test)[:, 1])
fpr5 , tpr5, thresholds5 = roc_curve(y_test, boosting.predict_proba(X_test)[:, 1])
fpr6 , tpr6, thresholds6 = roc_curve(y_test, stacking1.predict_proba(X_test)[:, 1])
fpr7 , tpr7, thresholds7 = roc_curve(y_test, stacking2.predict_proba(X_test)[:, 1])
plt.plot([0,1],[0,1], 'k--')
plt.plot(fpr1, tpr1, label= "Random Forest")
plt.plot(fpr2, tpr2, label= "Logistic Regression")
plt.plot(fpr3, tpr3, label= "Random Forest After Grid Search")
plt.plot(fpr4, tpr4, label= "Logistic Regression After Grid Search")
plt.plot(fpr5, tpr5, label= "Boosting")
plt.plot(fpr6, tpr6, label= "Stacking1")
plt.plot(fpr7, tpr7, label= "Stacking2")
plt.legend()
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title('Receiver Operating Characteristic')
plt.show()
Accuracy Comparison
accuracies = [
rf_accuracy,
lr_accuracy,
rf_accuracy_grid,
lr_accuracy_grid,
boosting_accuracy,
stacking1_accuracy,
stacking2_accuracy
]
# for labeling
model_names = [
'Random Forest',
'Logistic Regression',
'Random Forest After Grid Search',
'Logistic Regression After Grid Search',
'Boosting',
'Stacking1',
'Stacking2'
]
# Create an accuracy histogram
plt.figure(figsize=(10, 6))
plt.bar(model_names, accuracies, color='skyblue')
plt.xlabel('Models')
plt.ylabel('Accuracy')
plt.title('Accuracy Histogram of Models')
plt.xticks(rotation=45)
plt.ylim([min(accuracies) - 0.05, max(accuracies) + 0.05])
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
accuracies_df = {"Model Name":model_names,
"Accuracy Value":accuracies}
display(pd.DataFrame(accuracies_df))
| Model Name | Accuracy Value | |
|---|---|---|
| 0 | Random Forest | 0.907397 |
| 1 | Logistic Regression | 0.896053 |
| 2 | Random Forest After Grid Search | 0.907165 |
| 3 | Logistic Regression After Grid Search | 0.895821 |
| 4 | Boosting | 0.903808 |
| 5 | Stacking1 | 0.843732 |
| 6 | Stacking2 | 0.895821 |